Assignment 1 Recap

To recap, in Assignment 1, I (1) analysed information about the dataset I have chosen, and its associated publication, (2) mapped the expression data to HUGO gene symbols, (3) filtered out unncecessary genes, and (4) normalized the dataset. The following subsections go over these points in a bit more detail, and are taken from my Assignment 1. In addition, the code from my .rmd file includes code from A1- in order to regenerate the normalized counts matrix.

Brief Introduction to the Data

The expression dataset that I had chosen is GSE221253, taken from the GEO expression data repository. The study associated with this dataset involved adoptive cell therapy. Adoptive cell therapy via tumor infiltrating lymphocytes (ACT-TIL) is a type of immunotherapy used to treat cancer. This study took samples from 13 patients having metastatic melanoma and performed RNAseq analysis, along with other analyses like spatial proteomics and scRNAseq, on tumor tissues before and after the cell therapy (pre- and post-ACT) in order to gain a better understanding of the interactions and cell states within the tumor microenvironment throughout treatment. For the RNAseq experiment, this resulted in a total of 26 samples, and 13 in each condition (two for each patient, and the experimental conditions being pre- and post-ACT treatment).

Mapping to HUGO Gene Symbols

The first step from Assignment 1 was to map the expression data to HUGO gene symbols. Only 940 out of 19117 genes from our original gene expression raw file were unable to be mapped to HUGO gene symbols. This is actually not that bad!

Dataset Filteration and Normalization

The next step in Assignment 1 was to filter out the genes with low counts and normalize our data. Filteration of outlier genes with low gene counts removed ~3678 genes from the expression dataset. Then, TMM normalization was performed to normalize the data. The plots (pre and post normalization + filtration) are shown below:

Plot of the Data Before Normalization and Filtration
Plot of the Data Before Normalization and Filtration
Plot of the Data After Normalization and Filtration
Plot of the Data After Normalization and Filtration

Assignment 2 Recap

To recap, the first step of Assignment 2 is to perform differential gene expression, in order to identify key genes that are differentially expressed between our subgroups. This involves (1) constructing our model design, (2) estimating the dispersion, (3) performing multiple hypothesis testing and using the Quasi likelihood model to calculate differential expression, and (4) generate a heatmap and MA plot of the data. Next, we performed threshold over-representation analysis

Model Design

First, this involved defining our model design, and inspecting the MDS plot (below) to see if there was any clustering between our two groups (pre-ACT and post-ACT). It doesn’t seem that there is a lot of clustering amongst the pre-ACT and post-ACT samples. However, the study associated with our data set aims to understand interactions and cell states within the tumor microenvironment throughout treatment, so it makes the most sense to define the group for our model design as pre- and post-ACT treatment.

Estimate Dispersion + Top Gene Hits

Here, we estimate the dispersion, fit the model to our specified model design, and then calculate differential expression using the Quasi liklihood model. The table below shows the top gene hits.

logFC logCPM F PValue FDR
ALB -14.893974 11.900720 40.50206 7.0e-07 0.0018536
FOSB 3.791452 5.261109 39.29955 9.0e-07 0.0018536
FOS 2.600458 7.410897 38.90495 1.0e-06 0.0018536
FGB -13.907270 10.372034 38.81796 1.1e-06 0.0018536
FGA -13.295751 9.650812 38.26330 1.2e-06 0.0018536
ZC2HC1B -2.871639 1.604275 37.93189 1.3e-06 0.0018536
GC -11.892857 7.806030 37.19659 1.5e-06 0.0018536
APOH -11.785979 7.789941 37.07206 1.5e-06 0.0018536
KNG1 -11.937946 8.299449 37.02790 1.5e-06 0.0018536
ARG1 -8.562373 5.402457 36.46001 1.8e-06 0.0018536
x
1 BH
x
1 samplesInfoMatrix$timepre-ACT
x
1 glm

Multiple Hypothesis Testing

In the below code blocks, we compile all of the results, and find how many genes pass the threshold value, and how many genes pass correction.

How many genes pass the p-value threshold p < 0.05?

## [1] 4684

Now, let us see how many genes pass correction for multiple hypothesis testing. ~25% of the genes pass correction. This is a decent number, as it is not too much or not too little genes.

## [1] 1643

MA Plot

Here is an MA plot that displays the differential gene expression between our two groups of samples: pre- and post- ACT.

Heat Map

The following codeblock generates a heatmap of our data, along with annotations, to see how the data seems to cluster. Based on what is shown in the heatmap below, there does seem to be some minimal clustering of genes that are overexpressed in some pre-ACT samples.

Thresholded Over Representation Analysis

The last step of this assignment is to perform thresholded over-representation analysis (ORA). Essentially, our main goal here is to see whether our upregulated or downregulated genes all share some kind of common characteristic (maybe some of them are from the same pathway, etc). By doing this, we can then make conclusions about what kind of genes are overexpressed/underexpressed and ensure that it aligns with literature.

Define All, Upregulated, Downregulated genes

Here, we define our upregulated and downregulated genes. Upregulated genes are defined here to have a p-value of 0.05 and positive fold change. Downregulated genes are defined here to have a p-value of 0.05 and negative fold change.

## [1] "Number of upregulated genes"
## [1] 1170
## [1] "Number of downregulated genes"
## [1] 473
## [1] "Number of significant genes in total"
## [1] 1643

Running G:profiler on All Significant Genes

Here, we run run g:profiler on the set of all significant genes that have a p-value of < 0.05 to perform a gene set enrichment analysis.

query significant p_value term_size query_size intersection_size precision recall term_id source term_name effective_domain_size source_order parents
query_1 TRUE 6.630947e-07 1494 1393 199 0.14285714 0.1331995 GO:0044281 GO:BP small molecule metabolic process 16178 11317 GO:0008152
query_1 TRUE 2.189967e-05 2145 1393 257 0.18449390 0.1198135 GO:0009056 GO:BP catabolic process 16178 3269 GO:0008152
query_1 TRUE 5.598736e-05 4941 1393 517 0.37114142 0.1046347 GO:1901564 GO:BP organonitrogen compound metabolic process 16178 22098 GO:00068….
query_1 TRUE 7.868727e-05 267 1393 51 0.03661163 0.1910112 GO:0016053 GO:BP organic acid biosynthetic process 16178 5145 GO:00060….
query_1 TRUE 9.388388e-05 764 1393 109 0.07824838 0.1426702 GO:0006082 GO:BP organic acid metabolic process 16178 1967 GO:00442….
query_1 TRUE 9.388388e-05 264 1393 50 0.03589375 0.1893939 GO:0046394 GO:BP carboxylic acid biosynthetic process 16178 12522 GO:00160….

Running G:profiler on the Upregulated Gene Set

In the codeblock below, we run g:profiler on the set of upregulated genes to perform a gene set enrichment analysis.

query significant p_value term_size query_size intersection_size precision recall term_id source term_name effective_domain_size source_order parents
query_1 TRUE 0.004049154 88 997 18 0.018054162 0.20454545 GO:0042775 GO:BP mitochondrial ATP synthesis coupled electron transport 16178 10709 GO:0042773
query_1 TRUE 0.004049154 4941 997 368 0.369107322 0.07447885 GO:1901564 GO:BP organonitrogen compound metabolic process 16178 22098 GO:00068….
query_1 TRUE 0.004049154 88 997 18 0.018054162 0.20454545 GO:0042773 GO:BP ATP synthesis coupled electron transport 16178 10707 GO:00061….
query_1 TRUE 0.004049154 8726 997 605 0.606820461 0.06933303 GO:0044238 GO:BP primary metabolic process 16178 11300 GO:0008152
query_1 TRUE 0.004049154 428 997 51 0.051153460 0.11915888 GO:0007005 GO:BP mitochondrion organization 16178 2648 GO:0006996
query_1 TRUE 0.004049154 6 997 5 0.005015045 0.83333333 GO:0036115 GO:BP fatty-acyl-CoA catabolic process 16178 9776 GO:00091….
query_1 TRUE 0.004049154 46 997 13 0.013039117 0.28260870 GO:0006120 GO:BP mitochondrial electron transport, NADH to ubiquinone 16178 1998 GO:00196….
query_1 TRUE 0.004049154 160 997 26 0.026078235 0.16250000 GO:0009060 GO:BP aerobic respiration 16178 3273 GO:0045333
query_1 TRUE 0.004244812 8309 997 579 0.580742227 0.06968348 GO:0006807 GO:BP nitrogen compound metabolic process 16178 2516 GO:0008152
query_1 TRUE 0.004266470 9757 997 666 0.668004012 0.06825869 GO:0008152 GO:BP metabolic process 16178 3159 GO:0008150
query_1 TRUE 0.004353541 99 997 19 0.019057172 0.19191919 GO:0022904 GO:BP respiratory electron transport chain 16178 6868 GO:00229….
query_1 TRUE 0.004460309 186 997 28 0.028084253 0.15053763 GO:0045333 GO:BP cellular respiration 16178 11755 GO:0015980
query_1 TRUE 0.004460309 490 997 55 0.055165496 0.11224490 GO:0061919 GO:BP process utilizing autophagic mechanism 16178 16215 GO:0009987
query_1 TRUE 0.004460309 490 997 55 0.055165496 0.11224490 GO:0006914 GO:BP autophagy 16178 2592 GO:00442….
query_1 TRUE 0.004460309 59 997 14 0.014042126 0.23728814 GO:0072332 GO:BP intrinsic apoptotic signaling pathway by p53 class mediator 16178 17869 GO:00723….
query_1 TRUE 0.005219212 61 997 14 0.014042126 0.22950820 GO:0010257 GO:BP NADH dehydrogenase complex assembly 16178 4072 GO:0065003
query_1 TRUE 0.005219212 61 997 14 0.014042126 0.22950820 GO:0032981 GO:BP mitochondrial respiratory chain complex I assembly 16178 8243 GO:00102….
query_1 TRUE 0.005971288 1168 997 107 0.107321966 0.09160959 GO:0009057 GO:BP macromolecule catabolic process 16178 3270 GO:00431….
query_1 TRUE 0.007634004 2145 997 176 0.176529589 0.08205128 GO:0009056 GO:BP catabolic process 16178 3269 GO:0008152
query_1 TRUE 0.007634004 107 997 19 0.019057172 0.17757009 GO:0022900 GO:BP electron transport chain 16178 6867 GO:0006091

Running G:profiler on the Downregulated Gene Set

Here, we do the same for the set of downregulated genes, and perform a gene set enrichment analysis on them.

query significant p_value term_size query_size intersection_size precision recall term_id source term_name effective_domain_size source_order parents
query_1 TRUE 1.545313e-14 1494 396 94 0.23737374 0.06291834 GO:0044281 GO:BP small molecule metabolic process 16178 11317 GO:0008152
query_1 TRUE 2.470354e-14 764 396 63 0.15909091 0.08246073 GO:0006082 GO:BP organic acid metabolic process 16178 1967 GO:00442….
query_1 TRUE 1.688262e-13 758 396 61 0.15404040 0.08047493 GO:0043436 GO:BP oxoacid metabolic process 16178 11048 GO:0006082
query_1 TRUE 1.688262e-13 739 396 60 0.15151515 0.08119080 GO:0019752 GO:BP carboxylic acid metabolic process 16178 6252 GO:0043436
query_1 TRUE 5.379899e-11 299 396 34 0.08585859 0.11371237 GO:0044282 GO:BP small molecule catabolic process 16178 11318 GO:00090….
query_1 TRUE 5.379899e-11 24 396 12 0.03030303 0.50000000 GO:0042730 GO:BP fibrinolysis 16178 10679 GO:0030195

Gene Sets Returned from Downregulated and Upregulated Analyses

Here, we present the number of gene sets returned from both the downregulated and upregulated analyses:

## [1] "How many genesets were returned from downregulated analysis?"
## [1] 5057
## [1] "How many genesets were returned from upregulated analysis"
## [1] 7844

Interpretation

The authors in the original paper concluded that in samples after Adoptive Cell Therapy, there was an increase in the strength and number of tumor infiltrating lymphocytes (TILSs), as well as an increase in interferon-activated myeloid T-cells (Barras 2024). This suggests that there seems to be an overall improved immune response towards the tumor after cell therapy. Upon looking at the over-representation results for down-regulated and up-regulated genes, however, there does not seem to be much correlation between the papers findings and the over-representation results. The overrepresentation results list common metabolic process such as “small molecule metabolic process” and “mitochondrial ATP synthesis coupled electron transport” as top hits, but these are not indicative of anything as they are very general/broad processes.

Assignment 3

In this section, we begin assignment 3. The first step is to perform non-threshold Gene set Enrichment Analysis. Next, we will visualize our GSEA using Cytoscape. Lastly, we will interpret and have a more granular understanding of our results.

Non-thresholded Gene set Enrichment Analysis

There are a lot of existing gene set analysis algorithm that are non-thresholded, but we will be using GSEA, as it is very popular, and well trusted throughout literature (the number of citations it has tops all other methods). We will also be using gene sets from the BaderLab.

Non-Threshold GSEA- Interpretation Questions/Answers

1.What method did you use? What genesets did you use? Make sure to specify versions and cite your methods. There are a lot of existing gene set analysis algorithm that are non-thresholded, but I decided to use GSEA, as it is very popular, and well trusted throughout literature (the number of citations it has tops all other methods). I also used gene sets from the BaderLab. I followed the methods outlined here

2.Summarize your enrichment results. Upon opening the index file outputted by GSEA, we can see that:

4263 / 5903 gene sets are upregulated in phenotype Pre- Adoptive Cell Therapy,

1049 gene sets are significant at FDR < 25%, 445 gene sets are significantly enriched at nominal pvalue < 1%,

944 gene sets are significantly enriched at nominal pvalue < 5%,

1640 / 5903 gene sets are upregulated in phenotype Post- Adoptive Cell Therapy,

682 gene sets are significantly enriched at FDR < 25%,

441 gene sets are significantly enriched at nominal pvalue < 1%,

605 gene sets are significantly enriched at nominal pvalue < 5%

3. How do these results compare to the results from the thresholded analysis in Assignment #2. Compare qualitatively. Is this a straight forward comparison? Why or why not? When performing thresholded over-representation analysis in Assignment 2, we first had to filter out the genes based on their p-values, to ensure that we only had significant up-regulated and down-regulated genes, resulting in an overall lower number of genes than GSEA. For GSEA, we end up using all of the genes, regardless of their p-value, to ensure that we dont loose any signal. So, no it is not really a straightforward comparison due to the different methodologies used (with a threshold vs. without one).

Visualizing GSEA in Cytoscape

Our next step is to use our results from non-thresholded GSEA and visualize our results in Cytoscape. In these figures here, a node represents a gene set, and an edge represents genes in common between two gene sets. Red nodes represents gene sets specifically for the Pre-ACT group, and blue nodes represents gene sets specifically for the Post-ACT group. The three figures below show the main clusters present within the network.

First Group of Clusters in the Cytoscape Network with Pathway Names
First Group of Clusters in the Cytoscape Network with Pathway Names
Second Group of Clusters in the Cytoscape Network with Pathway Names
Second Group of Clusters in the Cytoscape Network with Pathway Names
Third Group of Clusters in the Cytoscape Network with Pathway Names
Third Group of Clusters in the Cytoscape Network with Pathway Names

We can actually visualize the different clusters of this network, and identify common themes/common gene sets (pathways) and biological processes. These cluster themes for each of the above figures are presented below (in the above order):

First Group of Clusters in the Cytoscape Network with Prominent Themes
First Group of Clusters in the Cytoscape Network with Prominent Themes
Second Group of Clusters in the Cytoscape Network with Prominent Themes
Second Group of Clusters in the Cytoscape Network with Prominent Themes
Third Group of Clusters in the Cytoscape Network with Prominent Themes
Third Group of Clusters in the Cytoscape Network with Prominent Themes

Visualize GSEA in Cytoscape- Interpretation Questions/Answers

1.Create an enrichment map - how many nodes and how many edges in the resulting map? What thresholds were used to create this map? Make sure to record all thresholds. Include a screenshot of your network prior to manual layout. Enrichment map is created above in the above figures. There are a total of 883 nodes and 6064 edges. To generate by map, I used a node cutoff threshold of Q-value = 0.01, and for my edge cutoff, I used a threshold value of Q-value = 0.375. I used default parameters for the rest of the fields.

2.Annotate your network - what parameters did you use to annotate the network. If you are using the default parameters make sure to list them as well. For my annotated network (figures above), I decided to annotate each gene set with its pathway name. The rest of the parameters I used for this were default.

3.Make a publication ready figure - include this figure with proper legends in your notebook. My publication ready figure with a figure legend is below:

Publication Ready Figure
Publication Ready Figure

4.Collapse your network to a theme network. What are the major themes present in this analysis? Do they fit with the model? Are there any novel pathways or themes? My above publication ready figure is organized in terms of the major themes presented. From the image, we see that there is a lot of pathways associated with processes: “coagulation cascades action”, “pathway valdecoxib action”, and “plasma lipoprotein particle”. There are other themes present, but these are just the largest ones. The major themes presented here could sort of agree with the model, since in the original paper, the authors concluded that in samples post Adoptive Cell Therapy,there was an increase in the strength and number of tumor infiltrating lymphocytes (TILSs), as well as an increase in interferon-activated myeloid T-cells (Barras 2024), and from the themes it seems that valdecoxib action pathways, which mediate innflamation and pain were very prevalent in the post-ACT group. This could potentially correlate with the papers findings as cell therapy could also treat inflammation and pain. From my understanding, there are no novel pathways or themes present.

Interpretation and detailed view of results

1.Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment #2 thresholded methods The enrichment map from GSEA, according to my enrichment map from cytoscape does sort of support conclusions discussed in the original paper. As I mentioned above, the original paper concluded that in samples post Adoptive Cell Therapy (ACT), there was an increase in the strength and number of tumor infiltrating lymphocytes (TILSs), as well as an increase in interferon-activated myeloid T-cells (Barras 2024). This means that post-ACT treatment, the bodies immune system has become stronger. The enrichment map shows a predominant cluster of pathways under the theme “valdecoxib action pathways”, and upon doing some literature search, I found that valdecoxib is essentially a drug that prevents inflammation and pain, which could have been added to the cell therapy treatment. I tried to look into literature to see whether there was any use of valdecoxcib in ACT treatment, but was unable to find much.

In regards to comparing these results with the results from assignment 2’s thresholded g:profiler methods, there was not much of a correlation. Common pathways presented in A2 were apoptotic pathways and molecular metabolic processes, neither of which really correlate with the big themes presented in the GSEA enrichment map.

2.Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result? As I mentioned above, valdecoxcib action pathways was a major theme presented in my enrichment map above. It is a drug that treats inflammation and pain source. Pathways associated with Valdecoxcib could’ve shown up potentially because it might’ve been added to the ACT treatment (to treat pain and inflammation) . I was not able to find any literature to support this.

Pathway/Theme to Investigate More Closely

The pathway that I chose to investigate in more detail was the Mefenamic Acid Action Pathway. I decided to choose this pathway as it is found in one of the most prevalent themes in our network above, and also plays a role in treating inflammation. Thus, I wanted to investigate it in a bit more detail. Below, I used STRING to generate the pathway as a gene network. I was unable to, however annotate the network or pathway with my log fold expression values and p-values, because every time I tried to import my rank file, and select gene names as the parameter, Cytoscape wouldn’t generate the correct output. I have explained this issue in my journal as well.

STRING Pathway Up Close
STRING Pathway Up Close

References

Morgan M, Ramos M (2023). BiocManager: Access the Bioconductor Project Package Repository. R package version 1.30.22, https://CRAN.R-project.org/package=BiocManager.

Davis, S. and Meltzer, P. S. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics, 2007, 14, 1846-1847

Xie Y (2023). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.45, https://yihui.org/knitr/.

Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140

Response to tumor-infiltrating lymphocyte adoptive therapy is associated with preexisting CD8+ T-myeloid cell … (n.d.). https://www.science.org/doi/10.1126/sciimmunol.adg7995

Gu Z, Eils R, Schlesner M (2016). “Complex heatmaps reveal patterns and correlations in multidimensional genomic data.” Bioinformatics. doi:10.1093/bioinformatics/btw313

Gu Z (2022). “Complex Heatmap Visualization.” iMeta. doi:10.1002/imt2.43.

Kolberg L, Raudvere U, Kuzmin I, Vilo J, Peterson H (2020). “gprofiler2– an R package for gene list functional enrichment analysis and namespace conversion toolset
g:Profiler.” F1000Research, 9 (ELIXIR)(709). R package version 0.2.3.

Seidel, C. S. C. (n.d.). Intro to Edger. https://research.stowers.org/cws/CompGenomics/Projects/edgeR.html

Shi D, Jiang P. A Different Facet of p53 Function: Regulation of Immunity and Inflammation During Tumor Development. Front Cell Dev Biol. 2021 Oct 18;9:762651. doi:10.3389/fcell.2021.762651. PMID: 34733856; PMCID: PMC8558413.

Agrawal a, et al. (2024) WikiPathways 2024: next generation pathway database. NAR.

Milacic M, Beavers D, Conley P, Gong C, Gillespie M, Griss J, Haw R, Jassal B, Matthews L, May B, Petryszak R, Ragueneau E, Rothfels K, Sevilla C, Shamovsky V, Stephan R, Tiwari K, Varusai T, Weiser J, Wright A, Wu G, Stein L, Hermjakob H, D’Eustachio P. The Reactome Pathway Knowledgebase 2024. Nucleic Acids Research. 2024. doi: 10.1093/nar/gkad1025.

Ashburner et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000 May;25(1):25-9. DOI: 10.1038/75556

Cytoscape App Store - enrichmentmap. (n.d.). https://apps.cytoscape.org/apps/enrichmentmap

Gary Bader, R. I. (n.d.). Pathway and network analysis of -OMICS data ( June 2023 ). Module 3 Lab: GSEA Visualization.https://baderlab.github.io/CBW_Pathways_2023/gsea_mod3.html

“Valdecoxib (Oral Route) Description and Brand Names.” Mayo Clinic, Mayo Foundation for Medical Education and Research, 1 Feb. 2024, www.mayoclinic.org/drugs-supplements/valdecoxib-oral-route/description/drg-20066654.